Goto

Collaborating Authors

 full law


Recursive Equations For Imputation Of Missing Not At Random Data With Sparse Pattern Support

Phung, Trung, Reese, Kyle, Shpitser, Ilya, Bhattacharya, Rohit

arXiv.org Artificial Intelligence

A common approach for handling missing values in data analysis pipelines is multiple imputation via software packages such as MICE (Van Buuren and Groothuis-Oudshoorn, 2011) and Amelia (Honaker et al., 2011). These packages typically assume the data are missing at random (MAR), and impose parametric or smoothing assumptions upon the imputing distributions in a way that allows imputation to proceed even if not all missingness patterns have support in the data. Such assumptions are unrealistic in practice, and induce model misspecification bias on any analysis performed after such imputation. In this paper, we provide a principled alternative. Specifically, we develop a new characterization for the full data law in graphical models of missing data. This characterization is constructive, is easily adapted for the calculation of imputation distributions for both MAR and MNAR (missing not at random) mechanisms, and is able to handle lack of support for certain patterns of missingness. We use this characterization to develop a new imputation algorithm -- Multivariate Imputation via Supported Pattern Recursion (MISPR) -- which uses Gibbs sampling, by analogy with the Multivariate Imputation with Chained Equations (MICE) algorithm, but which is consistent under both MAR and MNAR settings, and is able to handle missing data patterns with no support without imposing additional assumptions beyond those already imposed by the missing data model itself. In simulations, we show MISPR obtains comparable results to MICE when data are MAR, and superior, less biased results when data are MNAR. Our characterization and imputation algorithm based on it are a step towards making principled missing data methods more practical in applied settings, where the data are likely both MNAR and sufficiently high dimensional to yield missing data patterns with no support at available sample sizes.


Review for NeurIPS paper: A Robust Functional EM Algorithm for Incomplete Panel Count Data

Neural Information Processing Systems

Weaknesses: - The MCAR assumption is difficult to justify in practice. This is good, however, could the authors clarify some of the following points regarding their method in the context of MCAR missingness. By definition, MCAR implies that one can simply ignore any rows of data containing missingness and restricting the analysis to so called "complete cases" will still result in unbiased estimates of the parameter of interest. In light of this, and the bounds on \epsilon implying that there will always be complete cases in the data as n - \infty (if this were not true, the parameters of interest would not be identifiable) what is the advantage of the proposed EM algorithm over simply doing complete case analysis and using some of the older tools cited in the paper that can be run on complete data. I apologize if I missed this, but it doesn't seem like there's a baseline comparison to such a complete case analysis or to the alternative of directly maximizing the observed data likelihood by integrating according to patterns of missingness.


Zero Inflation as a Missing Data Problem: a Proxy-based Approach

Phung, Trung, Lee, Jaron J. R., Oladapo-Shittu, Opeyemi, Klein, Eili Y., Gurses, Ayse Pinar, Hannum, Susan M., Weems, Kimberly, Marsteller, Jill A., Cosgrove, Sara E., Keller, Sara C., Shpitser, Ilya

arXiv.org Artificial Intelligence

A common type of zero-inflated data has certain true values incorrectly replaced by zeros due to data recording conventions (rare outcomes assumed to be absent) or details of data recording equipment (e.g. artificial zeros in gene expression data). Existing methods for zero-inflated data either fit the observed data likelihood via parametric mixture models that explicitly represent excess zeros, or aim to replace excess zeros by imputed values. If the goal of the analysis relies on knowing true data realizations, a particular challenge with zero-inflated data is identifiability, since it is difficult to correctly determine which observed zeros are real and which are inflated. This paper views zero-inflated data as a general type of missing data problem, where the observability indicator for a potentially censored variable is itself unobserved whenever a zero is recorded. We show that, without additional assumptions, target parameters involving a zero-inflated variable are not identified. However, if a proxy of the missingness indicator is observed, a modification of the effect restoration approach of Kuroki and Pearl allows identification and estimation, given the proxy-indicator relationship is known. If this relationship is unknown, our approach yields a partial identification strategy for sensitivity analysis. Specifically, we show that only certain proxy-indicator relationships are compatible with the observed data distribution. We give an analytic bound for this relationship in cases with a categorical outcome, which is sharp in certain models. For more complex cases, sharp numerical bounds may be computed using methods in Duarte et al.[2023]. We illustrate our method via simulation studies and a data application on central line-associated bloodstream infections (CLABSIs).


Sufficient Identification Conditions and Semiparametric Estimation under Missing Not at Random Mechanisms

Guo, Anna, Zhao, Jiwei, Nabi, Razieh

arXiv.org Machine Learning

Conducting valid statistical analyses is challenging in the presence of missing-not-at-random (MNAR) data, where the missingness mechanism is dependent on the missing values themselves even conditioned on the observed data. Here, we consider a MNAR model that generalizes several prior popular MNAR models in two ways: first, it is less restrictive in terms of statistical independence assumptions imposed on the underlying joint data distribution, and second, it allows for all variables in the observed sample to have missing values. This MNAR model corresponds to a so-called criss-cross structure considered in the literature on graphical models of missing data that prevents nonparametric identification of the entire missing data model. Nonetheless, part of the complete-data distribution remains nonparametrically identifiable. By exploiting this fact and considering a rich class of exponential family distributions, we establish sufficient conditions for identification of the complete-data distribution as well as the entire missingness mechanism. We then propose methods for testing the independence restrictions encoded in such models using odds ratio as our parameter of interest. We adopt two semiparametric approaches for estimating the odds ratio parameter and establish the corresponding asymptotic theories: one involves maximizing a conditional likelihood with order statistics and the other uses estimating equations. The utility of our methods is illustrated via simulation studies.


On Testability and Goodness of Fit Tests in Missing Data Models

Nabi, Razieh, Bhattacharya, Rohit

arXiv.org Artificial Intelligence

Significant progress has been made in developing identification and estimation techniques for missing data problems where modeling assumptions can be described via a directed acyclic graph. The validity of results using such techniques rely on the assumptions encoded by the graph holding true; however, verification of these assumptions has not received sufficient attention in prior work. In this paper, we provide new insights on the testable implications of three broad classes of missing data graphical models, and design goodness-of-fit tests for them. The classes of models explored are: sequential missing-at-random and missing-not-at-random models which can be used for modeling longitudinal studies with dropout/censoring, and a no self-censoring model which can be applied to cross-sectional studies and surveys.